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Abstract 

We present experimental evidence that 
providing naive users of a spoken dia- 
logue system with immediate help mes- 
sages related to their out-of-coverage ut- 
terances improves their success in using 
the system. A grammar-based recog- 
nizer and a Statistical Language Model 
(SLM) recognizer are ran simultane- 
ously. If the grammar-based recognizer 
suceeds, the less accurate SLM recog- 
nizer hypothesis is not used. When the 
grammar-based recognizer fails and the 
SLM recognizer produces a recognition 
hypothesis, this result is used by the Tar- 
geted Help agent to give the user feed- 
back on what was recognized, a diag- 
nosis of what was problematic about the 
utterance, and a related in-coverage ex- 
ample. The in-coverage example is in- 
tended to encourage alignment between 
user inputs and the language model of 
the system. We report on controlled ex- 


periments on a spoken dialogue system 
for command and control of a simulated 
robotic helicopter. 

1 Introduction 

Targeted Help makes use of user utterances that 
are out-of-coverage of the main dialogue sys- 
tem recognizer to provide the user with immedi- 
ate feedback, tailored to what the user said, for 
cases in which the system was not able to un- 
derstand their utterance. These messages can be 
much more informative than responding to the 
user with some variant of “Sorry I didn’t under- 
stand’’, which is the behaviour of most current di- 
alogue systems. To implement Targeted Help we 
use two recognizers: the Primary Recognizer is 
constructed with grammar-based language model 
and the Secondary Recognizer used by the Tar- 
geted Help module is constructed with a Statis- 
tical Language Model (SLM). As part of a spo- 
ken dialogue system, grammar based recogniz- 
ers tuned to a domain perform very well, in fact 
better than comparable Statistical Language Mod- 


els (SLMs) for in-coverage utterances (Knight et 
al., 2001). However, in practice users will some- 
times produce utterances that are out of coverage. 
This is particularly true of non-expert users, who 
do not understand the limitations and capabilities 
of the system, and consequently produce a much 
lower percentage of in-coverage utteraces than ex- 
pert users. The Targeted Help strategy for achiev- 
ing good performance with a dialogue system is to 
use a grammar-based language model and assist 
users in becoming expert as quickly as possible. 
This approach takes advantage of the strengths of 
both types of language models by using the gram- 
mar based model for in-coverage utterances and 
the SLM as part of the Targeted Help system for 
out-of-coverage utterances. 

In this paper we report on controlled experi- 
ments, testing the effectiveness of an implementa- 
tion of Targeted Help in a mixed initiative dialogue 
system to control a simulated robotic helicopter. 

2 System Description 

2.1 The WITAS Dialogue System 

Targeted Help was deployed and tested as part 
of the WITAS dialogue system 1 , a command and 
control and mixed-initiative dialogue system for 
interacting with a simulated robotic helicopter or 
UAV (Unmanned Aerial Vehicle) (?). The dia- 
logue system is implemented as a suite of agents 
communicating though the SRI Open Agent Ar- 
chitecture (OAA) (Martin et al., 1998). The 
agents include: Nuance Communications Recog- 
nizer (Nuance, 2002); the Gemini parser and gen- 
erator (Dowding et al., 1993) (both using a gram- 
mar designed for the UAV application); Festival 
text-to-speech synthesizer (Systems, 2001); a GUI 
which displays a map of the area of operation 
and shows the UAV’s location; the Dialogue Man- 
ager (?); the Robot Control and Report compo- 
nent, which translates commands and queries bi- 
directionally between the dialogue interface and 
the UAV. The Dialogue Manager interleaves mul- 
tiple planning and execution dialogue threads (?). 

While the helicopter is airborne, an on-board 
active vision system will interpret the scene be- 

*See http : //www. ida. liu . se/ext/witas 
and http : //www-csli . Stanford. edu/semlab/ 
witas 


low to interpret ongoing events, which may be re- 
ported (via NL generation) to the operator. The 
robot can carry out various activities such as fly- 
ing to a location, fighting fires, following a ve- 
hicle, and landing. Interaction in WITAS thus 
involves joint-activities between an autonomous 
system and a human operator. These are activ- 
ities which the autonomous system cannot com- 
plete alone, but which require some human inter- 
vention (e.g. search for a vehicle). These activi- 
ties are specified by the user during dialogue, or 
can be initiated by the UAV. In any case, a major 
component of the dialogue, and a way of maintain- 
ing its coherence, is tracking the state of current 
or planned activities of the robot. This system is 
sufficiently complex to serve as a good testbed for 
Targeted Help. 

2.2 The Targeted Help Module 

The Targeted Help Module is a separate compo- 
nent that can be added to an appropriately struc- 
tured dialogue system with minimal changes to ac- 
comodate the specifics of the domain. This mod- 
ular design makes it quite portable, and a version 
of this agent is in fact being used in a second com- 
mand and control dialogue system (Hockey et al., 
2002; ?). It is argued in (?) that “low-level” 
processing components such as the Targeted Help 
module are an important focus for future dialogue 
system research. Figure 1 shows the structure of 
the Targeted Help component and its relationship 
to the rest of the dialogue system. 

The goal of the Targeted Help system is to han- 
dle utterances that cannot be processed by the 
usual components of the dialogue system, and to 
align the user’s inputs with the coverage of the sys- 
tem as much as possible. To perform this function 
the Targeted Help component must be able to de- 
termine which utterances to handle, and then con- 
struct help messages related to those utterances, 
which are then passed to a speech synthesizer. The 
module consists of three parts: 

• the Secondary Recognizer, 

• the Targeted Help Activator, 

• the Targeted Help Agent. 


t 



Figure 1: Architecture 


The Targeted Help Activator takes input from 
both the main grammar-based recognizer and the 
backup category-based SLM recognizer. It uses 
this input to determine when the Targeted Help 
component should produce a message. The Acti- 
vator’s behavior is as follows for the four possible 
combinations of recognizer outcomes: 

1. Both recognizers get a recognition hypothe- 
sis: 

Targeted Help remains inactive; normal dia- 
logue system proccessing proceeds 

2. Main recognizer gets a recognition hypothe- 
sis and secondary recognizer rejects: 

Targeted Help remains inactive; normal dia- 
logue system proccessing proceeds 

3. Main recognizer rejects, secondary recog- 
nizer gets a recognition hypothesis: 

Targeted Help is activated 

4. Both recognizers reject: 

Targeted Help is not activated, default system 
failure message is produced 

Once Targeted Help is activated, the Targeted 
Help Agent constructs a message based on the 
recognition hypothesis from the secondary SLM 
recognizer. These messages are composed of one 
or more of the following pieces: 


What the system heard: a report of the backup 
SLM recognition hypothesis. 

What the problem was: a description of the 
problem with the user’s utterance (e.g. the 
system doesn’t know a word); and 

What you might say instead: A similar in- 
coverage example. 

In constructing both the diagnostic of the prob- 
lem with the utterance, and the in-coverage exam- 
ple, we are faced with the question of whether the 
information from the secondary recognizer is suf- 
ficient to produce useful help messages. Since this 
domain is relatively novel, there is not very much 
data for training the SLM and the performance re- 
flects this. We have designed a rule based system 
that looks for patterns in the recognition hypothe- 
sis that seem to be detected adequately even with 
incomplete or inaccurate recognition. 

Diagnostics are of three major types: 

• endpointing errors, 

• unknown vocabulary, 

• subcategorization mistakes. 

We found from an analysis of transcripts that 
these three types of errors accounted for the ma- 
jority of failed utterances. Endpointing errors are 






cases of one or the other end of an utterance being 
cut off. For example, when the user says “search 
for the red car” but the system hears “for the red 
car”. We use information from the dialogue sys- 
tem’s parsing grammar (which has identical cover- 
age to its speech recognizer) to determine whether 
the initial word recognized for an utterance is a 
valid initial word. If not, the utterance is diag- 
nosed as a case of the user pressing the push-to- 
talk button too late and the system reports that to 
the user. Out-of- vocabulary items that can be iden- 
tified by Targeted Help are those that are in the 
SLM’s vocabulary but are out of coverage for the 
grammar based recognizer and so cannot be pro- 
cessed by the dialogue system. For these items 
Targeted Help produces a message of the form 
“the system doesn’t understand the word X”. 

Saying “Zoom in on the red car” when the sys- 
tem only has intransitive “zoom in” is an exam- 
ple of a subcategorization error. In these cases the 
word is in-vocabulary but has been used in a way 
that is out-of-grammar. To diagnose subcatego- 
rization errors we consult the recognition/parsing 
grammar for subcategorization information on in- 
vocabulary verbs in the secondary recognizer hy- 
pothesis, then check what else was recognized to 
determine if the right arguments are there. For 
these types of errors the system produces a mes- 
sage such as “the system doesn’t understand the 
word X used with the red car”. These diagnostics 
are one significant difference from the approach 
used in (Gorrell et al., 2002). The simple classi- 
fier approach used in that work to select example 
sentences would not support these types of diag- 
nostics. 

In constructing examples that are similar to the 
user’s utterance one issue is in what sense they 
should be similar. One aspect we have looked 
at is using in-coverage words from the user’s ut- 
terance. It is likely to help naive users learn 
the coverage of the system if the examples give 
them valid uses of in-coverage words they pro- 
duced in their utterance. By using words from the 
user’s utterance the system provides both confir- 
mation that those words are in coverage and an in- 
coverage pattern to imitate. We believe that this 
leads to greater linguistic alignment between the 
user and the system. Another aspect of similar- 


ity that we suspect is important is matching the 
utterance dialogue-move type (e.g. wh-question, 
yes/no-question, command) otherwise the user is 
likely to be misled into thinking that a particular 
type of dialogue-move is impossible in the system. 

Looking for in-coverage words is fairly robust. 
Even when the user produces an out-of-coverage 
utterance they are likely to produce some in- 
coverage words. The Targeted Help agent looks 
for within-domain words in the recognition hy- 
pothesis from the secondary SLM recognizer. This 
gives us a set of target words from which to 
match the example to the dialogue-move type of 
the user’s utterance: wh-question, yn-question, an- 
swer, or command. 

Furthermore, for commands (which are a large 
percentage of the utterances) we use the in- 
coverage words to produce a targeted in-coverage 
example that is interpretable by the system. These 
examples are intended to demonstrate how in- 
vocabulary words from the backup recognizer hy- 
pothesis could be successfully used in commu- 
nicating with the system. For example, if the 
user says something like “fly over to the hospi- 
tal”, where “over” is out-of-coverage, and the fall- 
back recognizer detected the words “fly” and “hos- 
pital”, the Targeted Help agent could provide an 
in-coverage example like “fly to the hospital”. For 
the other less frequent utterance types we have one 
in coverage example per type. The system cur- 
rently uses a look-up table but we hope to incor- 
porate generation work which would support gen- 
eration of these examples on the fly from a list of 
in-coverage words (?). 

3 Design of Experiments 

In order to assess the effectiveness of the targeted 
help provided by our system, we compared the 
performance of two groups of users, one that re- 
ceived targeted help, and one that did not. Twenty 
members of the Stanford University community 
were randomly assigned to one of the two groups. 
There were both male and female subjects, the ma- 
jority of subjects were in their twenties and none 
of the subjects had prior experience with spoken 
dialogue systems. The structure of the interaction 
with the system was the same for both groups. 
They were given minimal written instruction on 
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how to use the system before the interaction be- 
gan. They were then asked to use the system to 
complete five tasks, in which they directed a heli- 
copter to move within a city environment to com- 
plete various task oriented goals which were dif- 
ferent for four of the five tasks. For each task the 
goals were given immediately prior to the start of 
the interaction, in language the system could not 
process to prevent users from simply reading the 
goal aloud to the system. A given task ended when 
one of the following criteria was met: 

1. the task was accurately completed and the 
user indicated to the system that he or she had 
finished, 

2. the user believed that the task was completed 
and indicated this to the system when in fact 
the task was not accurately completed, or 

3. the user gave up. 

The first and last of the sequence of five tasks 
were the critical trials that were used to assess per- 
formance. Both of the tasks had goals of the form 
“locate an x and then land at the y” The experi- 
ment was conducted in a single session. An exper- 
imenter was present throughout, but when asked 
she refused to provide any feedback or hints about 
how to interact with the system. 

As stated above, the critical difference between 
the two groups of users was the feedback they re- 
ceived during interaction with the system. When 
the users in the No Help condition produced out- 
of-coverage utterances the system responded only 
with a text display of the message “not recog- 
nized”. In contrast, when users in the Help condi- 
tion produced out-of-coverage utterances they re- 
ceived in-depth feedback such as: “The system 
heard fly between the hospital and the school, un- 
fortunately it doesn’t understand fly when used 
with the words between the hospital and the 
school. You could try saying fly to the hospital .” 

We hypothesized that: 1) providing Targeted 
Help would improve users’ ability to complete 
tasks (higher TASK completion); and 2) time 
to complete tasks would be reduced for users re- 
ceiving Targeted Help (reduced time). We 
also anticipated that both effects would be more 


marked in the first task than in the fifth task 
(LARGER EARLY EFFECT). 

4 Experimental Results 

We found clear evidence that targeted help im- 
proves performance in this environment, as mea- 
sured by both the frequency with which the user 
simply explicitly gave up on a task, and the time 
to complete the remaining tasks. In this section we 
present the statistical analyses of the experiment. 
For the following analyses two subjects, both in 
the No Help condition, were excluded from the 
analyses because they gave up on every task, leav- 
ing 9 users in each of the two help conditions. Ex- 
ceptions are noted. 

We begin by examining the percentage of trials 
in which users explicitly gave up on a task before 
it was completed. We compared the percentage of 
trials in which the user clicked the “give up” but- 
ton in both tasks for users in both help conditions. 
As predicted, a 1 -within (Task), 1 -between (Help 
condition) subjects ANOVA revealed a main effect 
of the help condition (i 7 x(l,16)=6.000, p< .05). 
Users who received targeted help were less likely 
to give up than those who did not receive help, par- 
ticularly during the first task (11% vs. 27%). If 
we include the two subjects in the No Help con- 
dition who gave up on every task the difference is 
even more striking. For the first task only 1 1% of 
the users who received help gave up, compared to 
45% of the users who did not receive help. The 
pattern holds up even if we include the three in- 
tervening filler trials along with the experimental 
trials, as demonstrated by a paired t-test item anal- 
ysis (t(4) = 7.330, p<.05). Those who received 
help were less likely to explicitly give up even on 
this wider variety of tasks. 

We next examine the time it took users to com- 
plete the individual tasks. Here it is necessary to 
be clear about what is meant by “completion.” It 
is more ambiguous than it may seem. Each task 
had several sub-goals, and it was even difficult 
to objectively evaluate whether a single sub goal 
had been met. For instance, the goal of the first 
task was to find a red car near the warehouse and 
then land the helicopter. Users tended to indicate 
that they had finished as soon as they saw the red 
car, failing to land the helicopter as the instruc- 


tions specified. Another common source of ambi- 
guity was when the user saw the car on the map 
but never brought it up in the dialogue, simply 
landing the helicopter and clicking “finished.” The 
problem with this is that there is no way of know- 
ing whether the user actually saw the car before 
clicking finish, and there was no explicit record 
that they were aware of its presence. For all tri- 
als the experimenter evaluated the task comple- 
tion, recording what was done and what was left 
undone. According to the experimenter, in most 
cases of potential ambiguity the basic goal was 
completed. In a few instances, however, the user 
indicated belief that the task had been completed 
when it obviously had not. An example of this is 
the following: The goal specified was to find a red 
car near the warehouse and then land. The user 
flew the helicopter to the police station, and then 
clicked “finished,” ending the task. We dealt with 
the ambiguity problem by analyzing the time to 
completion data separately according to two dif- 
ferent inclusion criteria. In both cases the pattern 
was the same: Users who received help took less 
time to complete tasks than those who did not, the 
first task took longer to complete than the last one, 
and the difference between the help and no help 
conditions was more marked on the first task than 
on the last one. 

In the first analysis we included all trials in 
which the user clicked the “finished” button, re- 
gardless of their actual performance. Subjects who 
failed to complete one of the two critical tasks 
(tasks 1 and 5) were excluded from the analysis. 
We used a 1 -within (Task), 1 -between (Help con- 
dition) subjects ANOVA. For task 1, 89% of the 
trials in the Help condition and 55% of the trials 
in the No Help were considered ’’completed.” For 
task 5, 100% of the trials in the Help condition and 
80% of the trials in the No Help condition were 
considered “completed.” The analysis revealed a 
marginally significant main effect of the help con- 
dition (Fj(l,ll) = 3.809, p< . 1), a main effect of 
task (Fx,!! =62.545, p C.001) and a help condition 
by task interaction (Fx(l,ll)=10.203, p < .05). 
The effects were in the predicted direction. Users 
who received help took less time to complete tasks 
than those who did not (290.4 seconds vs. 440.6 
seconds), the first task took longer to complete 


Lenient Criterion Analysis 



Task! TaskS 


Figure 2: Time to complete task under Lenient 
Criterion for completion 

than the last one (365.5 seconds vs. 220.4 sec- 
onds), and the difference between the help and no 
help conditions was more marked on the first task 
than on the last one (150.2 seconds vs. 94 sec- 
onds). Figure 2 shows these results. 

One criticism of this analysis is that it may in- 
clude trials in which the task objectives were not 
accurately completed before the subject clicked 
“finished”. We wished to avoid experimenter sub- 
jectivity with respect to task completion, so we 
conducted another analysis using the strictest in- 
clusion criterion the experimental design allowed. 
In this analysis we included only those trials in 
which all task objectives were completed and 
could be verified using the transcripts. This meant 
that for all of the trials we included, the goal entity 
was explicitly mentioned in the dialogue. Accord- 
ing to this criterion only 44% of users in the Help 
condition and 18% of users in the No Help con- 
dition completed the first task. Similarly, 89% of 
users in the Help condition and 40% of users in the 
No Help condition accurately completed the task. 
Although this analysis is conducted on sparse data, 
it provides strong supporting evidence for the data 
pattern observed in the more lenient analysis. 

We examined the time it took to complete tasks 
according to the strict criterion, excluding all other 
trials. The ANOVA analysis was identical to the 
previous one. It, too, revealed a main effect of 
help condition (Fi(l,3) = 15.438, p<.05), a main 
effect of task (Fi^ =83.512, p < .01), and a help 
condition by task interaction (Fi(l,3)=20.335, p 
< .05). Again the effects were in the predicted di- 


Strict Criterion Analysis 
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Figure 3: Time to complete task under Strict Cri- 
terion for completion 

rection. Users who received help took less time to 
complete tasks than those who did not (226.2 sec- 
onds vs. 377.5 seconds), the first task took longer 
to complete than the last one (379.9 seconds vs. 
223.75), and the difference between the help and 
no help conditions was more marked on the first 
task than on the last one (190.4 seconds vs. 1 12.3 
seconds). These results are shown in Figure 3. 

5 Conclusions 

We have shown that users benefit from having on- 
line Targeted Flelp. Naive users who received 
Targeted Help messages were less likely to give 
up and significantly faster to complete tasks than 
users who did not. Overall, those who did not 
receive help gave up on 39% of the trials, while 
those who received our Targeted Help only gave 
up on 6% of the trials. With respect to time, 
when we considered all trials in which the user 
indicated that the goal had been completed (re- 
gardless of performance), those users who did not. 
receive our Targeted Help took 53% longer than 
those who did. Under stricter inclusion criteria, 
which required the users to explicitly mention the 
goal and accurately complete the task, the differ- 
ence was even more pronounced. Those users who 
did not receive help took 67.0% longer to com- 
plete the tasks than those who received our Tar- 
geted Help. In both help conditions, performance 
improved over the course of the experimental ses- 
sion. However, the advantage conferred by help 
merely diminished and did not disappear during 
the session. 


These findings are remarkable because they 
demonstrate that it is possible to construct ef- 
fective Targeted Help messages even from fairly 
low quality secondary recognition. Moreover, the 
study suggests that such an approach can improve 
the speed of training for naive users, and may re- 
sult in lasting improvements in the quality of their 
understanding. 

6 Future Work 

This work suggests many interesting directions for 
further research. One area of investigation is the 
contribution of various factors in the effectiveness 
of the Targeted Help message for example: 

• What benefit is due to the online nature of the 
help? 

• What benefit is due to the information con- 
tent? 

• What is the relative contribution of the vari- 
ous parts of the Targeted Help message to the 
improvement in user performance. 

- Is the diagnostic alone more or less ef- 
fective than the example alone? 

- How much does getting the back up rec- 
ognizer hypothesis help the user? 

- What is the most effective combination 
of these components? 

Another interesting direction is to look at effec- 
tiveness across different types of applications. The 
fact that we found positive results in this domain 
and that (Gorrell et al., 2002) also found a variant 
of Targeted Help useful on a quite different do- 
main suggests that the approach could be generally 
useful for a variety of types of dialogue systems. 
We are currently looking at porting our Targeted 
Help agent to additional domains. 
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